Identifying errors in sequence alignment to improve protein comparative modelling
نویسندگان
چکیده
The difference between the number of known protein sequences and the number of protein structures is vast and comparative modelling offers a way to bridge this gap. Misalignment between target and parent is the largest cause of error in comparative modelling and we define SSMAs (Sequence-Structure MisAlignments) as regions where sequence and structural alignments do not agree. We find that most SSMAs are short (< 10 residues) and that there is a strong preference for starting and finishing an SSMA in an unstructured region or a turn. Neural networks were trained to identify regions of sequence likely to be mis-aligned, first using single sequences to predict ‘alignability’ of homologues with ≤ 35% sequence identity and then combining predictions for single sequences to predict SSMAs in an alignment of two sequences. Predictions of SSMAs in single sequences had positive predictive values up to 89.1% (MCC=0.798) while the alignment predictions had positive predictive values 92.9% (MCC=0.648). In combination with a program to permute alignments, these networks were applied to comparative modelling of sequences previously submitted to CASP5. The average RMSD of these models improved by some 37% illustrating that the method is likely to be extremely valuable in improving alignment for comparative modelling.
منابع مشابه
Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparative protein modelling.
A multiple sequence alignment algorithm is described that uses a dynamic programming-based pattern construction method to align a set of homologous sequences based on their common pattern of conserved sequence elements. This pattern-induced multi-sequence alignment (PIMA) algorithm can employ secondary-structure dependent gap penalties for use in comparative modelling of new sequences when the ...
متن کاملIn silico protein recombination: enhancing template and sequence alignment selection for comparative protein modelling.
Comparative modelling of proteins is a predictive technique to build an atomic model for a given amino acid sequence, on the basis of the structures of other proteins (templates) that have been determined experimentally. Critical problems arise in this procedure: selecting the correct templates, aligning the query sequence with them and building the non-conserved surface loops. In this work, we...
متن کاملMethods for High-Throughput Comparative genomics and Distributed sequence Analysis
Title of Document: METHODS FOR HIGH-THROUGHPUT COMPARATIVE GENOMICS AND DISTRIBUTED SEQUENCE ANALYSIS Samuel Vincent Angiuoli, Ph.D., 2011 Directed By: Professor S.L. Salzberg, Department of Computer Science High-throughput sequencing has accelerated applications of genomics throughout the world. The increased production and decentralization of sequencing has also created bottlenecks in computa...
متن کاملProtein Sequence Alignment Analysis by Local Covariation: Coevolution Statistics Detect Benchmark Alignment Errors
The use of sequence alignments to understand protein families is ubiquitous in molecular biology. High quality alignments are difficult to build and protein alignment remains one of the largest open problems in computational biology. Misalignments can lead to inferential errors about protein structure, folding, function, phylogeny, and residue importance. Identifying alignment errors is difficu...
متن کاملAlgorithms for Protein Comparative Modelling and Some Evolutionary Implications
Protein comparative modelling (CM) is a predictive technique to build an atomic model for a polypeptide chain, based on the experimentally determined structures of related proteins (templates). It is widely used in Structural Biology, with applications ranging from mutation analysis, protein and drug design to function prediction and analysis, particularly when there are no experimental structu...
متن کامل